Take Home Exercise 3

Author

Joshua TING

Published

May 18, 2024

Modified

June 20, 2024

1 1. Introduction - Mini-Challenge 3

Setting the Scene

The business community in Oceanus is dynamic with new startups, mergers, acquisitions, and investments. FishEye International closely watches business records to keep tabs on commercial fishing operators. FishEye’s goal is to identify and prevent illegal fishing in the region’s sensitive marine ecosystem. Analysts are working with company records that show ownership, shareholders, transactions, and information about the typical products and services of each entity. FishEye’s analysts have a hybrid automated/manual process to transform the data into CatchNet: the Oceanus Knowledge Graph.

In the past year, Oceanus’s commercial fishing business community was rocked by the news that SouthSeafood Express Corp was caught fishing illegally. FishEye wants to understand temporal patterns and infer what may be happening in Oceanus’s fishing marketplace because of SouthSeafood Express Corp’s illegal behavior and eventual closure. The competitive nature of Oceanus’s fishing market may cause some businesses to react aggressively to capture SouthSeafood Express Corp’s business while other reactions may come from the awareness that illegal fishing does not go undetected and unpunished.

Question 1: FishEye analysts want to better visualize changes in corporate structures over time. Create a visual analytics approach that analysts can use to highlight temporal patterns and changes in corporate structures. Examine the most active people and businesses using visual analytics.

Question 4: Identify the network associated with SouthSeafood Express Corp and visualize how this network and competing businesses change as a result of their illegal fishing behavior. Which companies benefited from SouthSeafood Express Corp legal troubles? Are there other suspicious transactions that may be related to illegal fishing? Provide visual evidence for your conclusions.

2 1. Loading Packages

Code Chunk
pacman::p_load(jsonlite, tidygraph, ggraph, kableExtra,
               visNetwork, graphlayouts, ggforce, textstem, 
               skimr, tidytext, tidyverse, gganimate,dplyr, lubridate, DT)

3 2. Data Preparation

4 Data Import

In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.

Code Chunk
mc3_data <- fromJSON("data/MC3_/mc3.json")

4.1 Extracting edges

The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.

Code Chunk
mc3_edges <- as_tibble(mc3_data$links) %>% 
  unnest(source) %>% 
  distinct() %>% 
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type),
         startdate = as_datetime(start_date)) %>%
  group_by(source, target, type, startdate) %>% 
  summarise(weights = n()) %>% 
  filter(source != target) %>%
  ungroup()

head(mc3_edges)
# A tibble: 6 × 5
  source                 target                type  startdate           weights
  <chr>                  <chr>                 <chr> <dttm>                <int>
1 4. SeaCargo Ges.m.b.H. Dry CreekRybachit Ma… Even… 2034-12-31 00:00:00       1
2 4. SeaCargo Ges.m.b.H. KambalaSea Freight I… Even… 2033-04-12 00:00:00       1
3 9. RiverLine CJSC      SumacAmerica Transpo… Even… 2028-12-02 00:00:00       1
4 Aaron Acosta           Manning-Pratt         Even… 2008-09-14 00:00:00       1
5 Aaron Acosta           Manning-Pratt         Even… 2008-07-30 00:00:00       1
6 Aaron Allen            Hicks-Calderon        Even… 2025-03-06 00:00:00       1

Splitting Words

Code Chunk
word_list <- strsplit(mc3_edges$type, "\\.")
Code Chunk
max_elements <- max(lengths(word_list))
Code Chunk
word_list_padded <- lapply(word_list, function(x) c(x, rep(NA, max_elements - length(x))))
Code Chunk
word_df <- do.call(rbind, word_list_padded)
colnames(word_df) <- paste0("event", 1:max_elements)
Code Chunk
word_df <- as_tibble(word_df) %>%
  select(event2, event3)
class(word_df)
[1] "tbl_df"     "tbl"        "data.frame"
Code Chunk
mc3_edges <- mc3_edges %>%
  cbind(word_df)
Note

Learning from Code Chunk Above

  • distinct() is used to ensure that there will be no duplicated records.

  • mutate() and as.character() are used to convert the field data type from list to character.

  • group_by() and summarise() are used to count the number of unique links.

  • the filter(source!=target) is to ensure that no record with similar source and target.

4.2 Extracting nodes

The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.

Code Chunk
# extract all nodes from graph
mc3_nodes <- as_tibble(mc3_data$nodes) %>% 
  mutate(country = as.character(country),
         id = as.character(id),
         revenue = as.numeric(as.character(revenue)),
         type = as.character(type)) %>%
  select(id, country, type, revenue)

# extract all nodes from edges
id1 <- mc3_edges %>%
  select(source, type) %>%
  rename(id = source) %>% 
  mutate(country = NA, revenue = NA) %>% 
  select(id, country, type, revenue)

id2 <- mc3_edges %>%
  select(target, type) %>%
  rename(id = target) %>% 
  mutate(country = NA, revenue = NA) %>% 
  select(id, country, type, revenue)

additional_nodes <- rbind(id1, id2) %>% 
  distinct %>% 
  filter(!id %in% mc3_nodes[["id"]])

# combine all nodes
mc3_nodes_updated <- rbind(mc3_nodes, additional_nodes) %>%
  distinct()

head(mc3_nodes_updated)
# A tibble: 6 × 4
  id                          country     type                        revenue
  <chr>                       <chr>       <chr>                         <dbl>
1 Abbott, Mcbride and Edwards Uziland     Entity.Organization.Company   5995.
2 Abbott-Gomez                Mawalara    Entity.Organization.Company  71767.
3 Abbott-Harrison             Uzifrica    Entity.Organization.Company      0 
4 Abbott-Ibarra               Islavaragon Entity.Organization.Company      0 
5 Abbott-Sullivan             Oceanus     Entity.Organization.Company   4747.
6 Acevedo and Sons            Imazam      Entity.Organization.Company  46567.
Note

Learning From Code Chunk Above

  • mutate() and as.character() are used to convert the field data type from list to character.

  • To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.

  • select() is used to re-organise the order of the fields.

5 3. Data Exploration

5.1 Exploring Edges DF

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.

Code Chunk
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 75815
Number of columns 7
_______________________
Column type frequency:
character 5
numeric 1
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.0 6 42 0 51996 0
target 0 1.0 6 48 0 8926 0
type 0 1.0 14 31 0 4 0
event2 0 1.0 4 18 0 3 0
event3 14908 0.8 15 19 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0.01 1 1 1 1 2 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
startdate 90 1 1952-05-31 2035-12-29 2023-12-12 11468

In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

Code Chunk
DT::datatable(mc3_edges)
Code Chunk
ggplot(data = mc3_edges,
       aes(x = type)) +
  geom_bar()

In the above graph, it shows the counts of people that either is a beneficial owner, shareholder, employee or family member of the organisation. A substantial of the graph belongs to the shareholders.

6 Initial Network Visualisation and Analysis

6.1 Building network model with tidygraph

Code Chunk
mc3_graph <- tbl_graph(nodes = mc3_nodes_updated,
                       edges = mc3_edges,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

5.1 Tidying Nodes

Code Chunk
mc3_nodes1 <- mc3_nodes_updated %>%
  mutate(type = gsub("[(]", "", type)) %>% 
  mutate(type = gsub("\"", "", type)) %>%
  mutate(type = gsub("[)]", "", type)) 
Code Chunk
graph <- tbl_graph(nodes = mc3_nodes_updated, edges = mc3_edges, directed = TRUE)
Code Chunk
period1_start <- as_date("2023-01-01")
period1_end <- as_date("2023-06-30")
period2_start <- as_date("2023-07-01")
period2_end <- as_date("2023-12-31")
Code Chunk
# Create graph objects for each period
graph_period1 <- tbl_graph(nodes = mc3_nodes_updated, edges = mc3_edges, directed = TRUE)
graph_period2 <- tbl_graph(nodes = mc3_nodes_updated, edges = mc3_edges, directed = TRUE)
Code Chunk
# Plot for period 1
plot1 <- ggraph(mc3_graph, layout = 'fr') +
  geom_edge_link() +
  geom_node_point() +
  theme_graph() +
  ggtitle("Plot 1")
Code Chunk
set.seed(123)
mc3_graph %>%
  filter(betweenness_centrality >= 1000000) %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(#width= weights,
                     alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    color = type,
    alpha = 0.3)) +
  geom_node_label(aes(label = id),repel=TRUE, size=2.5, alpha = 0.8) +
  scale_size_continuous(range=c(1,10)) +
  theme_graph() +
  labs(title = 'Initial network visualisation',
       subtitle = 'Entities with betweenness scores > 1,000,000')